Trend Models

Week 3

Shan Zhang

Old Dominion University

Topics

  1. Forecast Variance

  2. Trend Models

    • Functional Forms
    • Intercept Breaks
    • Trend Breaks

Example: GDP

\[ GDP_t = \alpha + \beta_1i_t + \beta_2u_t + \epsilon_t\] \[ \hat{GDP}_t = \hat{\alpha} + \hat{\beta}_1i_t + \hat{\beta}_2u_t, \ \ \text{for} \ t \in \{0, ..., T\}\]

Suppose you estimate \(\hat{\alpha} = 10\), \(\hat{\beta_1} = 4\), and \(\hat{\beta_2} = -2\). Then, use this model for \(t \in \{T+1, ..., T + h\}\).

\[\hat{GDP}_t = 10 + 4i_t - 2u_t\]

Sample Mean

We are almost always after \(\mu\), the population average.

However, we almost always need to settle for \(\overline{x}\):

\[ \overline{x} = \frac{1}{n}\sum_{i=1}^{n}x_i \]

Of course, the population of \(x_i\)’s has some variance \(\sigma^2\). Again, we usually estimate this as \(s^2\).

Then, \(\overline{x}\) has a standard deviation (standard error): \(\sqrt{\frac{s^2}{n}}\)

Note: \(\overline{x} - \mu\) has the same variance as \(\overline{x}\)

Forecast Variance\(^*\)

If \(b_0\), the sample average, is used as the forecast for \(y_{T+h}\), then the prediction error is the sum of the forecast error and the estimation uncertainty:

\[y_{T+h} - b_0 = e_{T+h} + \beta_0 - b_0\]

Then, the variance of this can be decomposed into:

\[ var(e_{T+h}) \]

\[ \sigma^2 \]

\[ var(\beta_0 - b_0) \]

\[ \frac{\sigma^2}{T} \]

Forecast Variance\(^*\)

\[\sigma^2 + \frac{\sigma^2}{T}\]

\[(1 + \frac{1}{T})\sigma^2\]

\[ S_{T+h} = \sqrt{(1 + \frac{1}{T})\sigma^2} \]

Note that this is larger than the typical regression standard error: \(\sqrt{(\frac{1}{N})\sigma^2}\)

So what?

When we construct confidence intervals for a regular mean, we use the following formula:

\[\overline{x} \pm z\sqrt{(\frac{1}{n})s^2} \]

However, for predictions:

\[\overline{b_0} \pm z\sqrt{(1 + \frac{1}{T})s^2} \]

Calculate Prediction Standard Errors

Read in Personal Consumption Expenditure data.

Code
library("lubridate")
pce <- read.csv("../data/DPCERL1Q225SBEA.csv")
colnames(pce) <- c("date", "pc")

pce$date <- ymd(pce$date)
pce$horizon <- pce$date > ymd("2013-09-01")

plot(pce$date[!pce$horizon],
     pce$pc[!pce$horizon],
     type = "l",
     ylab = "Personal Consumption Expenditure",
     xlab = "Quarter")

Calculate Prediction Standard Errors

Calculate Prediction Standard Errors

Run a regression with only a constant:

Code
reg <- lm(pc ~ 1,
          pce[!pce$horizon,])
summary(reg)

Call:
lm(formula = pc ~ 1, data = pce[!pce$horizon, ])

Residuals:
     Min       1Q   Median       3Q      Max 
-14.8748  -1.8748  -0.0248   1.9252  18.8252 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.3748     0.2102   16.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.427 on 265 degrees of freedom
Code
mean(pce$pc[!pce$horizon])
[1] 3.374812

Calculate Prediction Standard Errors

Confidence Interval:

Code
pred2 <- predict(reg, interval = "confidence", level = 0.90)
pred2[1,]
     fit      lwr      upr 
3.374812 3.027930 3.721695 

Prediction Interval:

Code
pred1 <- predict(reg, interval = "prediction", level = 0.90)
pred1[1,]
      fit       lwr       upr 
 3.374812 -2.293295  9.042919 

Calculate Prediction Standard Errors

Code
plot(pce$date[!pce$horizon],
     pce$pc[!pce$horizon], type = "l",
     ylab = "Personal Consumption Expenditure",
     xlab = "Quarter")
lines(pce$date[!pce$horizon], pred1[,1], col = "tomato")
lines(pce$date[!pce$horizon], pred1[,2], col = "dodgerblue", lty = 2)
lines(pce$date[!pce$horizon], pred1[,3], col = "dodgerblue", lty = 2)

Calculate Prediction Standard Errors

Calculate Prediction Standard Errors

Code
pred_horizon <- predict(reg, newdata = pce[pce$horizon,],
                        interval = "prediction",
                        level = .9)
lim <- !pce$horizon & pce$date >= ymd("2000-01-01")
plot(pce$date[lim],
     pce$pc[lim], type = "l",
     ylim = c(min(pce$pc),
              max(pce$pc)),
     xlim = ymd(c("2000-01-01", "2016-01-01")),
     ylab = "Personal Consumption Expenditure",
     xlab = "Quarter")
lines(pce$date[pce$horizon], pred_horizon[,1], col = "tomato")
lines(pce$date[pce$horizon], pred_horizon[,2], col = "dodgerblue", lty = 2)
lines(pce$date[pce$horizon], pred_horizon[,3], col = "dodgerblue", lty = 2)

Calculate Prediction Standard Errors

PCE vs Air Passengers

Code
par(mfrow = c(1,2))
plot(pce$date[!pce$horizon],
     pce$pc[!pce$horizon],
     type = "l",
     ylab = "Personal Consumption Expenditure",
     xlab = "")
data("AirPassengers")
plot(AirPassengers, ylab = "Air Passengers")

PCE vs Air Passengers

Mean Shift

What if there is a policy introduced?

  • Example: Stop and Frisk in New York City.

Drastic, sharp mean changes (i.e. policy changes, etc.):

  • are difficult to anticipate

  • “ruin” your forecasting model

NYPD Data

Below, we have data on daily counts of both crime complaints and stop & frisk incidents.

Code
crime_snf <- readRDS("../data/crime_snf.RDS")
crime <- crime_snf[[1]]; snf <- crime_snf[[2]]
crime <- crime[crime$date %in% ymd("2007-01-01"):ymd("2016-12-31"),]
snf <- snf[snf$date %in% ymd("2007-01-01"):ymd("2016-12-31"),]

Weekly Crime Complaints

Code
aggregate(list(freq = crime$freq),
          list(date = floor_date(crime$date, "week")),
          sum) -> tmp
plot(tmp$date, tmp$freq,
     xlab = "Week", ylab = "Frequency",
     type = "l")

Weekly Crime Complaints

Code
plot(tmp$date, tmp$freq,
     xlab = "Week", ylab = "Frequency",
     type = "l")
abline(v = ymd("2008-01-01") + years(0:8),
       lty = 2, col = scales::alpha("black", .4))

Weekly Stop & Frisk Incidents

Code
crime2 <- tmp
aggregate(list(freq = snf$freq),
          list(date = floor_date(snf$date, "week")),
          sum) -> tmp
lim <- tmp$date < ymd("2012-05-01")
plot(crime2$date, crime2$freq,
     xlab = "Week", ylab = "Frequency",
     type = "l",
     col = scales::alpha("black", .4),
     ylim = c(min(tmp$freq[lim]), max(tmp$freq[lim])))
lines(tmp$date[lim], tmp$freq[lim],
      col = "dodgerblue", lwd = 2)
abline(v = ymd("2008-01-01") + years(0:8),
       lty = 2, col = scales::alpha("black", .4))

Floyd et al. v. City of New York

  • Jan. 31, 2008: Center for Constitutional Rights files initial complain
  • Nov. 7, 2008: CCR files Motion for Class Certification
  • May 16, 2012: Judge grants Class Certification Motion
  • Mar 18, 2013: Trial Begins
  • Aug 12, 2013: Judge rules City liable for constitutional violations and orders broad reforms
  • May 6, 2022: NYPD Continues to Underreport Use of Stop and Frisk; Severe Racial Disparities Persist

Weekly Stop & Frisk Incidents

Code
plot(crime2$date, crime2$freq,
     xlab = "Week", ylab = "Frequency",
     type = "l",
     col = scales::alpha("black", .4),
     ylim = c(min(tmp$freq), max(tmp$freq)))
lines(tmp$date[lim], tmp$freq[lim],
      col = "dodgerblue", lwd = 2)
abline(v = ymd("2008-01-01") + years(0:8),
       lty = 2, col = scales::alpha("black", .4))
lines(tmp$date[!lim], tmp$freq[!lim],
      col = "tomato", lwd = 2)

Weekly Stop & Frisk Incidents

Code
lim_time <- crime2$date %in% ymd("2010-01-01"):ymd("2015-01-01")
plot(crime2$date[lim_time], crime2$freq[lim_time],
     xlab = "Week", ylab = "Frequency",
     type = "l",
     col = scales::alpha("black", .4),
     ylim = c(min(tmp$freq), max(tmp$freq)))
lines(tmp$date[lim & lim_time], tmp$freq[lim & lim_time],
      col = "dodgerblue", lwd = 2)
# abline(v = ymd("2008-01-01") + years(0:8),
#        lty = 2, col = scales::alpha("black", .4))
lines(tmp$date[!lim & lim_time], tmp$freq[!lim & lim_time],
      col = "tomato", lwd = 2)
abline(v = ymd(c("2012-05-12", "2013-03-18", "2013-08-13")),
       col = "mediumseagreen")

What are our options?

  1. Scrap “old” data. Estimate on the “new” subsample.
  2. Use dummy variables

Dummy Variables

\[E[y_{t+h}|\Omega_t] = \beta_0 + \beta_1*I(t > t^{*}) \]

How do we interpret \(\beta_1\)?

Coding Breaks

Code
reg1 <- lm(freq ~ 1, data = tmp)
tmp$post1 <- ifelse(tmp$date > ymd("2012-05-12"), 1, 0)
tmp$post2 <- ifelse(tmp$date > ymd("2013-03-18"), 1, 0)
tmp$post3 <- ifelse(tmp$date > ymd("2013-08-12"), 1, 0)
reg2 <- lm(freq ~ post1 + post2 + post3, data = tmp)
library("stargazer")
stargazer(reg1, reg2, type = "html",
          omit.stat = c("ser", "f"),
          column.labels = "Stop & Frisk Count")
Dependent variable:
freq
Stop Frisk Count
(1) (2)
post1 -3,439.785***
(272.580)
post2 -3,843.540***
(448.532)
post3 -3,107.412***
(391.836)
Constant 6,840.966*** 10,933.900***
(221.941) (101.428)
Observations 522 522
R2 0.000 0.889
Adjusted R2 0.000 0.888
Note: p<0.1; p<0.05; p<0.01

Plotting Breaks

Code
plot(tmp$date[lim & lim_time], tmp$freq[lim & lim_time],
     xlab = "Week", ylab = "Frequency",
     type = "l", ylim = c(min(tmp$freq), max(tmp$freq)),
     xlim = c(min(tmp$date[lim_time]),
              max(tmp$date[lim_time])),
     col = "dodgerblue", lwd = 2)
lines(tmp$date[!lim & lim_time], tmp$freq[!lim & lim_time],
      col = "dodgerblue", lwd = 2)
abline(v = ymd(c("2012-05-12", "2013-03-18", "2013-08-13")),
       col = "black")
lines(tmp$date, predict(reg1, tmp), lty = 2,
      lwd = 2, col = "tomato")
lines(tmp$date, predict(reg2, tmp), lty = 3,
      col = "mediumseagreen", lwd = 2)
legend("topright", bty = "n", lty = c(2:3),
       legend = c("Reg 1", "Reg 2"), cex = 1.3,
       col = c("tomato", "mediumseagreen"))

Stop & Frisk Overview

  • Policy makers might have looked at downward trending\(^*\) crime and upward trending Stop & Frisk as an encouraging sign!
    • Donohue & Levitt (2001)
  • However, there were a few very important dates in the court case Floyd, et al. v. City of New York, et al. that provide exogenous variation in Stop & Frisk intensity.

Be(ing) Careful

Mean shifts are difficult to anticipate ex-ante.

How should we select these breaks?

  • Judgement (policy changes, important events, shocks)
  • Information (visual inspection)
  • Data-Based (“p-hacking”? Atheoretical)

Finding Breaks

Code
rsqr <- c()
for(i in 10:(nrow(tmp)-11)){
  tmp$post <- ifelse(tmp$date > (min(tmp$date) + weeks(i)), 1, 0)
  regx <- lm(freq ~ post, tmp)
  rsqr <- c(rsqr, summary(regx)$r.squared)
}
rsqr_plot <- c(rep(NA, 10), rsqr, rep(NA, 10))
max_date <- tmp$date[which(rsqr_plot == max(rsqr_plot, na.rm = TRUE))]
plot(tmp$date, rsqr_plot, ylab = "R-Squared",
     xlab = c("Break Point", paste0("Date Selected: ", max_date)),
     type = "l", lwd = 2,
     ylim = c(0, 1))
lines(tmp$date, tmp$freq/max(tmp$freq),
      col = scales::alpha("black", .4),
      type = "l")
abline(v = max_date, col = "red")

Gradually Changing Breaks

What if the mean is gradually changing over time?

Trend Model: \(y_{t} = f(\text{time}_t)\)

Different Types of Trend Models:

  • Linear
  • Quadratic
  • Exponential

Types of Trend Models

Linear:

\[y_t = \beta_0 + \beta_1 \text{Time}_t\]

Quadratic:

\[y_t = \beta_0 + \beta_1 \text{Time}_t + \beta_2 \text{Time}_t^2\]

Exponential:

\[y_t = e^{\beta_0 + \beta_1 \text{Time}_t} \implies log(y_t) = \beta_0 + \beta_1 \text{Time}_t\]

Example of Linear Trend

Code
labor <- readRDS("../data/labor_particip.RDS")
lim <- labor$date < ymd("1992-01-01")
par(mfrow = c(1,2))
plot(labor$date[lim], labor$men[lim], type = "l",
     xlab = "", ylab = "Participation Rate",
     main = "Men")
plot(labor$date[lim], labor$women[lim], type = "l",
     xlab = "", ylab = "Participation Rate",
     main = "Women")

Example of Quadratic Trend

Code
sales <- readRDS("../data/sales.RDS")
lim <- sales$date < ymd("1993-01-01")
plot(sales$date[lim], sales$sales[lim], type = "l",
     xlab = "", ylab = "Sales")

Example of Exponential Trend

Code
par(mfrow = c(1,2))
sandp <- readRDS("../data/sandp.RDS")
lim <- sandp$date < ymd("1992-01-01")
plot(sandp$date[lim], sandp$volume[lim], type = "l",
     xlab = "", ylab = "Sales")
plot(sandp$date[lim], log(sandp$volume[lim]), type = "l",
     xlab = "", ylab = "Sales",
     ylim = c(min(log(sandp$volume[lim])),
              max(asinh(sandp$volume[lim]))),
     col = scales::alpha("tomato", .6))
lines(sandp$date[lim], asinh(sandp$volume[lim]), type = "l",
     xlab = "", ylab = "Sales",
     col = scales::alpha("dodgerblue", .6))
legend("topleft", legend = c("log()", "asinh()"),
       col = c("tomato", "dodgerblue"),
       lty = 1, bty = "n")

Fitting Trend Models

Code
lim <- labor$date < ymd("1993-01-01")
labor$time <- interval(min(labor$date), labor$date) %/% months(1)

reg1 <- lm(women ~ date, labor[lim,])
reg2 <- lm(women ~ time, labor[lim,])
library("stargazer")
stargazer(reg1, reg2,
          type = "html",
          omit.stat = c("ser", "f"))
Dependent variable:
women
(1) (2)
date 0.002***
(0.00001)
time 0.051***
(0.0004)
Constant 42.120*** 28.705***
(0.057) (0.113)
Observations 540 540
R2 0.973 0.973
Adjusted R2 0.973 0.973
Note: p<0.1; p<0.05; p<0.01

Fitting Trend Models

Code
pred <- predict(reg1, interval = "predict", level = 0.90)

plot(labor$date[lim], labor$women[lim], type = "l",
     xlab = "", ylab = "Labor Force Participation")
lines(labor$date[lim], reg1$fitted.values,
      col = "tomato", lty = 2)
lines(labor$date[lim], pred[,2], col = "dodgerblue")
lines(labor$date[lim], pred[,3], col = "dodgerblue")

Fitting Trend Models

Code
new_lim <- labor$date %in% ymd("1980-01-01"): ymd("2009-12-01")
new_pred <- predict(reg1, labor[!lim,], interval = "predict", level = 0.90)

plot(labor$date[new_lim & lim],
     labor$women[new_lim & lim],
     type = "l",
     xlab = "", ylab = "Labor Force Participation",
     xlim = c(min(labor$date[new_lim]), max(labor$date[new_lim])),
     ylim = c(min(labor$women[new_lim], new_pred[,2]),
              max(labor$women[new_lim], new_pred[,3])))
lines(labor$date[lim], reg1$fitted.values,
      col = "tomato", lty = 2)

lines(labor$date[lim], pred[,2], col = "dodgerblue")
lines(labor$date[lim], pred[,3], col = "dodgerblue")

lines(labor$date[!lim], new_pred[,1], lty = 2)
lines(labor$date[!lim], new_pred[,2], col = "mediumseagreen")
lines(labor$date[!lim], new_pred[,3], col = "mediumseagreen")

Fitting Trend Models

Code
plot(labor$date[new_lim & lim],
     labor$women[new_lim & lim],
     type = "l",
     xlab = "", ylab = "Labor Force Participation",
     xlim = c(min(labor$date[new_lim]), max(labor$date[new_lim])),
     ylim = c(min(labor$women[new_lim], new_pred[,2]),
              max(labor$women[new_lim], new_pred[,3])))
lines(labor$date[lim], reg1$fitted.values,
      col = "tomato", lty = 2)

lines(labor$date[lim], pred[,2], col = "dodgerblue")
lines(labor$date[lim], pred[,3], col = "dodgerblue")

lines(labor$date[!lim], new_pred[,1], lty = 2)
lines(labor$date[!lim], new_pred[,2], col = "mediumseagreen")
lines(labor$date[!lim], new_pred[,3], col = "mediumseagreen")

lines(labor$date[!lim], labor$women[!lim], lwd = 2)

Fitting Trend Models

Code
lim <- sales$date < ymd("1995-01-01")
sales$time <- interval(min(sales$date), sales$date) %/% months(1)

reg1 <- lm(sales ~ time, sales[lim,])
reg2 <- lm(sales ~ time + I(time^2), sales[lim,])

plot(sales$time[lim], sales$sales[lim], type = "l",
     xlab = "", ylab = "Sales")
lines(sales$time[lim], reg1$fitted.values, col = "tomato")
lines(sales$time[lim], reg2$fitted.values, col = "dodgerblue")

Fitting Trend Models

Code
pred <- predict(reg2, sales[!lim,])
plot(sales$time[lim], sales$sales[lim], type = "l",
     xlab = "", ylab = "Sales",
     xlim = c(min(sales$time), max(sales$time)),
     ylim = c(min(sales$sales, pred),
              max(sales$sales, pred)))
lines(sales$time[lim], reg2$fitted.values, col = "dodgerblue")
lines(sales$time[!lim], pred, col = "dodgerblue", lty = 2)

Fitting Trend Models

Code
plot(sales$time[lim], sales$sales[lim], type = "l",
     xlab = "", ylab = "Sales",
     xlim = c(min(sales$time), max(sales$time)),
     ylim = c(min(sales$sales, pred),
              max(sales$sales, pred)))
lines(sales$time[lim], reg2$fitted.values, col = "dodgerblue")
lines(sales$time[!lim], sales$sales[!lim], lwd = 2)
lines(sales$time[!lim], pred, col = "dodgerblue", lty = 2)

Fitting Trend Models

Code
lim <- sandp$date < ymd("1990-01-01")
sandp$time <- interval(min(sandp$date), sandp$date) %/% weeks(1)

reg1 <- lm(log(volume) ~ time, sandp[lim,])
reg2 <- lm(asinh(volume) ~ time, sandp[lim,])

par(mfrow = c(1, 2))
plot(sandp$time[lim], sandp$volume[lim], type = "l",
     xlab = "", ylab = "S&P Transactions")
lines(sandp$time[lim], exp(reg1$fitted.values),
      col = scales::alpha("tomato", .4), lwd = 4)
lines(sandp$time[lim], sinh(reg2$fitted.values),
      col = scales::alpha("dodgerblue", .4), lwd = 4)

plot(sandp$time[lim], asinh(sandp$volume[lim]), type = "l",
     xlab = "", ylab = "Transformed S&P Transactions",
     col = scales::alpha("dodgerblue", .6),
     ylim = c(min(log(sandp$volume[lim])),
              max(asinh(sandp$volume[lim]))))
lines(sandp$time[lim], log(sandp$volume[lim]),
     col = scales::alpha("tomato", .6))
lines(sandp$time[lim], reg1$fitted.values,
      col = scales::alpha("tomato", .4), lwd = 4)
lines(sandp$time[lim], reg2$fitted.values,
      col = scales::alpha("dodgerblue", .4), lwd = 4)

Fitting Trend Models

Code
pred <- predict(reg2, sandp[!lim,], interval = "predict", .9)
pred <- pred$fit
plot(sandp$time[lim], sandp$volume[lim], type = "l",
     xlab = "", ylab = "S&P Transactions",
     xlim = c(min(sandp$time), max(sandp$time)),
     ylim = c(0,
              max(sinh(pred[,3]))))
lines(sandp$time[lim], sinh(reg2$fitted.values),
      col = scales::alpha("dodgerblue", .6), lwd = 4)
lines(sandp$time[!lim], sinh(pred[,1]),
      col = scales::alpha("dodgerblue", .6),
      lwd = 4, lty = 2)
lines(sandp$time[!lim], sinh(pred[,2]),
      col = scales::alpha("dodgerblue", .6),
      lwd = 4, lty = 1)
lines(sandp$time[!lim], sinh(pred[,3]),
      col = scales::alpha("dodgerblue", .6),
      lwd = 4, lty = 1)

Fitting Trend Models

Code
plot(sandp$time[lim], sandp$volume[lim], type = "l",
     xlab = "", ylab = "S&P Transactions",
     xlim = c(min(sandp$time), max(sandp$time)),
     ylim = c(0,
              max(sandp$volume, pred[,3])))
lines(sandp$time[!lim], sandp$volume[!lim], lwd = 1)
lines(sandp$time[lim], sinh(reg2$fitted.values),
      col = scales::alpha("dodgerblue", .6), lwd = 4)
lines(sandp$time[!lim], sinh(pred[,1]),
      col = scales::alpha("dodgerblue", .6),
      lwd = 4, lty = 2)
lines(sandp$time[!lim], sinh(pred[,2]),
      col = scales::alpha("dodgerblue", .6),
      lwd = 4, lty = 1)
lines(sandp$time[!lim], sinh(pred[,3]),
      col = scales::alpha("dodgerblue", .6),
      lwd = 4, lty = 1)

Issues with Trend Models

  1. Trends will understate uncertainty.

  2. Uncertainty will increase with time horizon.

    • It is important to choose an appropriate horizon and sample length.
    • Consider quadratic fits as an example of this.
  3. Forecasters usually predict growth rates, and then convert back to levels.

  4. There is no economic theory behind trends!

Evaluating Forecasts

MAE - Mean Absolute Error: \(\frac{1}{T}\sum_{t = 1}^{T}|e_t|\)

RMSE - Root Mean Square Error: \(\sqrt{\frac{1}{T}\sum_{t = 1}^{T} e^2}\)

MAPE - Mean Absolute Percent Error: \(\frac{|e_{T+h}|}{|Y_{T+h}|}\)

Next Class

  • Autocorrelation
  • White Noise
  • Moving Averages
  • Exam Review

Programming

* vs :

Functions

  • rmse()
  • mape()